This report explores the King county house prices dataset which is available on Kaggle website. In this analysis we will try to find out on which factors is the house price dependend on.
## id date price bedrooms bathrooms sqft_living
## 1 7129300520 20141013T000000 221900 3 1.00 1180
## 2 6414100192 20141209T000000 538000 3 2.25 2570
## 3 5631500400 20150225T000000 180000 2 1.00 770
## 4 2487200875 20141209T000000 604000 4 3.00 1960
## 5 1954400510 20150218T000000 510000 3 2.00 1680
## 6 7237550310 20140512T000000 1225000 4 4.50 5420
## sqft_lot floors waterfront view condition grade sqft_above sqft_basement
## 1 5650 1 0 0 3 7 1180 0
## 2 7242 2 0 0 3 7 2170 400
## 3 10000 1 0 0 3 6 770 0
## 4 5000 1 0 0 5 7 1050 910
## 5 8080 1 0 0 3 8 1680 0
## 6 101930 1 0 0 3 11 3890 1530
## yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1955 0 98178 47.5112 -122.257 1340 5650
## 2 1951 1991 98125 47.7210 -122.319 1690 7639
## 3 1933 0 98028 47.7379 -122.233 2720 8062
## 4 1965 0 98136 47.5208 -122.393 1360 5000
## 5 1987 0 98074 47.6168 -122.045 1800 7503
## 6 2001 0 98053 47.6561 -122.005 4760 101930
## [1] 21613 21
Our dataset consists of 21 variables, with almost 21613 observations.
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## id date price
## Min. :1.000e+06 20140623T000000: 142 Min. : 75000
## 1st Qu.:2.123e+09 20140625T000000: 131 1st Qu.: 321950
## Median :3.905e+09 20140626T000000: 131 Median : 450000
## Mean :4.580e+09 20140708T000000: 127 Mean : 540088
## 3rd Qu.:7.309e+09 20150427T000000: 126 3rd Qu.: 645000
## Max. :9.900e+09 20150325T000000: 123 Max. :7700000
## (Other) :20833
## bedrooms bathrooms sqft_living sqft_lot
## Min. : 0.000 Min. :0.000 Min. : 290 Min. : 520
## 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040
## Median : 3.000 Median :2.250 Median : 1910 Median : 7618
## Mean : 3.371 Mean :2.115 Mean : 2080 Mean : 15107
## 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688
## Max. :33.000 Max. :8.000 Max. :13540 Max. :1651359
##
## floors waterfront view condition
## Min. :1.000 Min. :0.000000 Min. :0.0000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.500 Median :0.000000 Median :0.0000 Median :3.000
## Mean :1.494 Mean :0.007542 Mean :0.2343 Mean :3.409
## 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :3.500 Max. :1.000000 Max. :4.0000 Max. :5.000
##
## grade sqft_above sqft_basement yr_built
## Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
## 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median : 7.000 Median :1560 Median : 0.0 Median :1975
## Mean : 7.657 Mean :1788 Mean : 291.5 Mean :1971
## 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
## Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
##
## yr_renovated zipcode lat long
## Min. : 0.0 Min. :98001 Min. :47.16 Min. :-122.5
## 1st Qu.: 0.0 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3
## Median : 0.0 Median :98065 Median :47.57 Median :-122.2
## Mean : 84.4 Mean :98078 Mean :47.56 Mean :-122.2
## 3rd Qu.: 0.0 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1
## Max. :2015.0 Max. :98199 Max. :47.78 Max. :-121.3
##
## sqft_living15 sqft_lot15
## Min. : 399 Min. : 651
## 1st Qu.:1490 1st Qu.: 5100
## Median :1840 Median : 7620
## Mean :1987 Mean : 12768
## 3rd Qu.:2360 3rd Qu.: 10083
## Max. :6210 Max. :871200
##
## [1] "id" "date" "price" "bedrooms"
## [5] "bathrooms" "sqft_living" "sqft_lot" "floors"
## [9] "waterfront" "view" "condition" "grade"
## [13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
## [17] "zipcode" "lat" "long" "sqft_living15"
## [21] "sqft_lot15"
The metatdata description of the variables is not provided in Kaggle. Although some of them are easy to understand there are a few which is ambigious.
The dataset contains the following 21 variables:
id : The id of the house
date : The date when this information was taken.(dates between May 2014 and May 2015.)
price : Price of the house
bedrooms : No of bedrooms
bathrooms : No of bathrooms
sqft_living : Living area in square feet
sqft_lot : Lot area in square feet
floors : No of floors
waterfront : House has a waterfront or not view :
condition : House condition ranging from 1 to 5
grade : House grade ranging from 1 to 13
sqft_above: Living area excluding the basement sqft_basement : Basement area yr_built : The year the house was build yr_renovated : The year the house was renovated
zipcode : Zipcode of the house lat : Latitude long : Longitude
sqft_living15 : The average house square footage of the 15 closest houses sqft_lot15 : The average lot square footage of the 15 closest houses
The columns id,date are redundant in this analysis so it is better to remove them.
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 221900 3 1.00 1180 5650 1 0 0
## 2 538000 3 2.25 2570 7242 2 0 0
## 3 180000 2 1.00 770 10000 1 0 0
## 4 604000 4 3.00 1960 5000 1 0 0
## 5 510000 3 2.00 1680 8080 1 0 0
## 6 1225000 4 4.50 5420 101930 1 0 0
## condition grade sqft_above sqft_basement yr_built yr_renovated zipcode
## 1 3 7 1180 0 1955 0 98178
## 2 3 7 2170 400 1951 1991 98125
## 3 3 6 770 0 1933 0 98028
## 4 5 7 1050 910 1965 0 98136
## 5 3 8 1680 0 1987 0 98074
## 6 3 11 3890 1530 2001 0 98053
## lat long sqft_living15 sqft_lot15
## 1 47.5112 -122.257 1340 5650
## 2 47.7210 -122.319 1690 7639
## 3 47.7379 -122.233 2720 8062
## 4 47.5208 -122.393 1360 5000
## 5 47.6168 -122.045 1800 7503
## 6 47.6561 -122.005 4760 101930
Preparing the dataset
Adding and modifying columns to be used later.
Creating a categorical variable for price and assigning them to “cheap”,“moderate”,“high”,“expensive”,“max”. Creating categorical variable for the year_built column. Created 5 categories for this.
Converting the varibles waterfront,condition,floors,grade,bedrooms into factor variables.
## 'data.frame': 21613 obs. of 21 variables:
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : Factor w/ 13 levels "0","1","2","3",..: 4 4 3 5 4 5 4 4 4 4 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 1 1 1 1 3 1 1 3 ...
## $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : Factor w/ 12 levels "1","3","4","5",..: 6 6 5 6 7 10 6 6 6 6 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## $ price_cat : Factor w/ 5 levels "Cheap","Moderate",..: 2 3 1 3 3 4 2 2 2 2 ...
## $ yr_builtR : chr "1951-1975" "1951-1975" "1926-1950" "1951-1975" ...
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 221900 3 1.00 1180 5650 1 0 0
## 2 538000 3 2.25 2570 7242 2 0 0
## 3 180000 2 1.00 770 10000 1 0 0
## 4 604000 4 3.00 1960 5000 1 0 0
## 5 510000 3 2.00 1680 8080 1 0 0
## 6 1225000 4 4.50 5420 101930 1 0 0
## condition grade sqft_above sqft_basement yr_built yr_renovated zipcode
## 1 3 7 1180 0 1955 0 98178
## 2 3 7 2170 400 1951 1991 98125
## 3 3 6 770 0 1933 0 98028
## 4 5 7 1050 910 1965 0 98136
## 5 3 8 1680 0 1987 0 98074
## 6 3 11 3890 1530 2001 0 98053
## lat long sqft_living15 sqft_lot15 price_cat yr_builtR
## 1 47.5112 -122.257 1340 5650 Moderate 1951-1975
## 2 47.7210 -122.319 1690 7639 High 1951-1975
## 3 47.7379 -122.233 2720 8062 Cheap 1926-1950
## 4 47.5208 -122.393 1360 5000 High 1951-1975
## 5 47.6168 -122.045 1800 7503 High 1976-2000
## 6 47.6561 -122.005 4760 101930 Expensive 2001-2015
Libraries used:
Out of the houses in King county a major portion of them are of 3 bedrooms which is ideal for small families.
There is a gradual increase in the number of houses built across the years with the maximum number of houses built in 2014.
Most of the houses have a condition of 3 in King County.
The most common bathrooms are 1 full bath and 2 full bath and 1 powder room in the houses in king county.
The most common grade is 7 in King county with grades 1,2,3,13 being nonexistent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 1427 1910 2080 2550 13540
The median sqft_living is 1910 sq feet with the mean being 2080 sq feet.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520 5040 7618 15107 10688 1651359
The median sqft_lot is 7618 sq feet with the mean being 15107 sq feet. The max lot size is quite huge compared to the the other houses and may be considered as an outlier.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75000 321950 450000 540088 645000 7700000
The median price of the houses in king county is around 45000.
The density distribution gives a plot that is right skewed.
When the price is change to log of price the distribution changes to normal distrbution from a skewed distribution.
Density distribution for sqft_living of the houses and it is right skewed.
## zipcode count city
## 43 98103 602 Seattle
## 24 98038 590 Maple Valley
## 50 98115 583 Seattle
## 29 98052 574 Redmond
## 52 98117 553 Seattle
These 5 zipcodes have the highest number of houses and the top 3 cities are Seattle, Maple valley and Redmond.
There are 21613 houses in the dataset with 21 features. The variables condition, grade, and view, waterfront are ordered factor variables. From the analysis done it seems that the most common condition is 3. The median price of the houses is 45000,the most common bedroom is 3,the most common grade is 7. As it is not mentioned in the metatdata which of the values of the factor variables is better it has to be found out later in bivariate plots.
The main features in the data set are price and sqft_living. I’d like to determine which features are best for predicting the price of a houses and which features have a strong correlation with price. I suspect sqft_living,sqft_lot, bedrooms and grade can help to build a predictive model to price of houses.
I would like to explore condition, sqft_above, yr_build,yr_renovated and zipcode features of the dataset and find insights from the data.
Yes I created price_cat,yr_buildR to have caegorical variables for conducting furthur plots.
I will convert waterfront,condition,floors,grade into factor variables for bivariate and multivariate analysis.
By this Q-Q plots for each price category we see the variable sqft_above has a right skewed distribution.
In this correlation matrix of the price with other variables, the most dominant correlations price has is with sqft_living(0.7), grade(0.67), sqft_above(0.61), sqft_living15(0.59) and bathrooms(0.53). This was different from what I had expected. I was hoping that bedrooms,zip_code,condition and sqft_lot should have a higher correlation with price. Zipcode has a negative correlation(-0.05) which is unusual. The average sqft living of 15 nearby houses (sqft_living15) is quite relevant as in neighbourhood which has higher sqft living size, most of the houses will be of similar size and also price.
So now I will focus on these variables with higher correlation.
Now looking into the plots for those variables with higher correlation value.
Sqft_living has the highest correlation with price which is obvious and in the price vs sqft_living we see a linear relationship between the two. The regression line shows that the relationship is linear.
This plots made using ggpairs gives a lot of information at one time. Similarly sqft_above has more or less a linear with price. The density function of sqft_above is also right skewed. This correlation is the third largest among the variables selected.
As bathrooms have a strong correlation with price we plotted this to further understand the relationship. As the umber of bathroom increase the spread increase of the boxplot indicating that the price difference is increasing. Also as the number of bathrooms increase the first quantile value price also goes up giving a positive correlation with price.
Similarly this plot shows a gradual positive correlation of price with increasing grade.Hence we can figure that houses with grade 13 is the best as it is the costliest. There are a lot of outliers in the grades ranging from 7 to 11
## # A tibble: 12 x 3
## grade avgPrice count
## <fctr> <dbl> <int>
## 1 1 142000.0 1
## 2 3 205666.7 3
## 3 4 214381.0 29
## 4 5 248524.0 242
## 5 6 301919.6 2038
## 6 7 402590.3 8981
## 7 8 542852.8 6068
## 8 9 773513.2 2615
## 9 10 1071771.1 1134
## 10 11 1482857.7 398
## 11 12 2191222.0 90
## 12 13 3058181.8 11
The above is also proved by this summary table.
The floor that has the highest average price is 2.5 followed by 3.5.
Now looking into the correlation between grades and sqft above, I see there is a positive correlation.
From this plot I figured that view 4 is the best among the four.
There are a lot of outliers in this boxplot for bedrooms 3,4,5,6 which indicate that price and bedrooms are not that correlated as I had assumed before.
Relationship of price vs sqft lot is clearly not linear and hence sqft lot clearly does not affect price of houses in king county.
In this density plot the distribution of the price is studied with respect to waterfront being present or not. In the case where waterfront is not present the prices are lower with the spread being less compared to the normal distribution with waterfront where the price range is larger.
In the neighbourhood, the number of houses having basement is less than the number of housing not having basement. By comparing the average prices we see that it is higher for the houses with that have basement. It means most people live in houses that does not have basement as the prices is comparitively much higher.
In the time range of 1900-1925 houses were more expensive compared to 1926-2000. In 2000 there is again increase in the price range.
In the early 1900s(1900-1950) mostly houses were built with 1,1.5 floors. But, gradually through the years in the 2000s this has changed and mostly houses are built with 2 floors and 3.5 floors is also on the high.
By plotting condition vs the year category I found that most of the houses built in he 2000s are of condition 3 compared to the previous years where other condition houses were also built. I can assume that condition 3 signifies a mid condition that can be afforable by the middle working class.
Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?
Yes there were some relationships of interest. # Houses built in 2000s are mostly of condition 3. # Houses built in 2000s are mostly of 2 and 3.5 floors compared to 1900-1950 which were 1 and 1.5 floors
The feature of interest was price and it has a positive correlation with sqft_living,grade,sqft_above. Contrary to my belief, number of bedrooms, sqft-lot and zipcode does not have a strong correlation with price.
In the early 1900s(1900-1950) mostly houses were built with 1,1.5 floors. But, gradually through the years in the 2000s this has changed and mostly houses are built with 2 floors and 3.5 floors is also on the high.
The strongest relationship is of price vs sqft_living as it has a positive correlation of 0.7
In this plot I tried to density distribution of sqft living for each floor and found that similar to price the height for 2.5 floor is maximum and variation of price is less as compared to the other floor factors.
In this distribution the price is compared for each condition value. For the condition 2 the variation of price is less compared to the the other conditions.
The mean prices of the houses that were built in 2000 to 2012 was way lower that those renovated in that period. But there is a gradual shift and the mean price built has increased after 2012.
There is a strong correlation between sqft_living and sqft_living15. The grades from 1 to 6 has usually smaller values for sqft_living and sqft_living15 and it gradually increases for grades 7 to 13.
For the same sqft_living houses built in 1900 has higher price than the ones built in 2000s. This is very interesting observation. But houses made in larger sqft_living has increased in the 2000s.
This plot helps to understand that houses with condition 1,2 has both less range of price and sqft above compared to the other 3.
This plot of bedrooms and floors has horizontal strips. The majority of houses with 3 floors have grade 8. For floors 2, the majority is 8 and 9. For floors 1 and 1.5 the most common grade is 6 and 7.
The plots shows that lot size is quite high compared to the sqft living. the majority of houses are with 1 and 2 floors and between 2400 to 7000 sqft_living. Most of the houses are of view 0.
In this plot I am trying to figure how the house condition is distributed in the different zipcodes. The most common condition is 3 and it has the majority in most zipcodes followed by 4 condition.
This visualization predicts that the houses near the coastline are more expensive than the houses that are not.
In this plot I wanted o see how the houses were built over the years. In the Seattle region the houses were built near the coastline first. Then as time progressed people started moving towards the mainland. In the right side of the map we see most houses built after 1950s indicating newer settlemets than the left side. Also I have addded waterfront as size so we can identify houses with waterfront.
The costlier houses are built along the coastline and they are built in the 1900. The prices have remained high for those houses.
The mean prices of the houses that were built in 2000 to 2012 was way lower that those renovated in that period. But there is a gradual shift and the mean price built has increased after 2012.
I choose this as price has the as this gives us the strongest correlation with sqft_living(0.71).
The construction of houses in the cheap category has gradually reduced and nowerdays cheap category is not built at all. It is edging towards the expensive side now in the 2000s. The moderate price category was maximun in the 1951-1975 range.
The houses near the coastline is the cosliest and also built earlier in time. This indicates that people first inhabited the coastline and then in the later years more towards the interior in King County.
First for doing this project comming up with a tidy dataset with lots of features was a challenge. I did a lot of research and found this tidy dataset on Kaggle which suited my requirements. The King county house pricing dataset has 21613 observations and 21 features. I wanted to explore all the 21 features and started by plotting the individual variables in the data set. Then I looked for any interesting relationships present in the dataset mong the features. I had expected that price was positively correlated with a lot of variables. But I was wrong as value for number of bedrooms,zipcode, lot size was lot. The most important correlation is sqft of living. I found out the that condition 3 is the most prevelant among all houses. The houses were first built along the coastline and then people moved towards the mainland. I loved working with this housing prices data. It would have been great if the dataset contained day and month for the built year. I could have drawn a lot of insights and also use time series forcasting to predict for the next years to come. I would continue with price prediction later on for this dataset. I plan to do both of these things if possible with this dataset.